Exploratory Data Analysis

In this report the data collected by the Twindle will be explored. This will be done by executing the following steps, based on "A Comphrehensive Guide to Data Exploration" by Sunil Ray of analytics Vidhya.

  1. Variable Identification
  2. Univariate analysis
  3. Bi-variate analysis
  4. Missing value treatment
  5. Outlier treatment

1. Variable Identification

Each variable can either be continuous or categorical. Both need to be analyzed in separate ways. In this chapter we are going to define these types for each variable.

1.1 Insights

The datatype seem to be loaded mostly correct. The exception is 'DOOR_OPEN_STATUS' this should be transformed to be a categorical value, either open or closed, instead of a continuous variable.

2. Univariate Analysis

Univariate analysis is the simplest form of analyzing data. Uni means one, so in other words the data has only one variable. Univariate data requires to analyze each variable separately. This will be done by visualizing the data in histograms and boxplots to detect any anomalies or outliers.

2.1 Continuous Variables

2.2 Categorical variables

2.3 Insights

The data contains large outliers. When using the IQR outlier removal method about 20% of the data is dropped. This might indicate that there are, or were, faulty measurements being made or that there are indeed many anomalies.

There is also a difference in which data is recorded for each room. The data reading pipeline results in a dataset that has data for the most important, all air quality and temperature, columns. This means that only data for rooms a, b and c for the Postillion hotel and the boardroom for the Big Top are available.

Before dropping the outliers the data was heavily skewed. This might cause inaccuracies in regression models. To prevent this normalization techniques might need to be applied.

3. Bi-variate analysis

During bi-variate analysis the relation between two variables is explored. This will be done by creating correlation coefficient heatmaps, visualizing the most correlated features via scatterplots and seeing the change in value over time.

3.1 Correlation coefficient heatmaps

3.2 Insights

When looking at the heatmaps and scatterplots we can conclude that there few strongly related features. The strength of the also differs per room, indicating that there might be other factors that needs to be considered.

The features of the dataset can be used to define the air quality of a room. It needs to be expanded with data that cause changes to these features.

4. Missing Value Treatment

During the preparation phase the event based data is converted to tabular data. This is done by joining the events based on timestamps. When there are no events within range this will result in NaN values, as can be seen below. During the model development a pipeline step could be added to impute the missing data.

4.1 Insights

The 'DOOR_OPEN_TIMES' columns contains the total number of times the door has been opened. Seeing as this count is not restarted each day and that most of the values are missing this column can probably be dropped. The rest of the missing data ranges between ~5% - 60%. When training a model this data needs to be either dropped or imputed.

5. Outlier Treatment

for the outlier treatment see the 'iqr_outlier_removal' function. The current outlier removal method drops ~20% of the datapoints. If this is too much the quantiles can be tweaked to contain less data points.

6. Conclusion

This Exploratory Data Analysis was conducted to see how the data collected by the Twindle application is put together. The following observations have been made:

Based on these observations the data requirements can be researched. There needs to be a single feature that indicates air quality, based on temperature, humidity, etc. Data that signal changes in the current features need to be added to the dataset. Together this can be used to develop a machine learning model that is able to predict the air quality.